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Summary 

Problem 

Conventional  methods  for  scoring  aptitude  and  achievement  tests  that  are  used  in  selecting, 
classifying,  and  training  military  personnel  discard  useful  information  about  an  examinee's  ability/ 
skill  level.  Information  is  lost  whenever  the  original  responses  to  test  questions  are  classified  only 
as  “right”  or  “wrong.”  Additional  information  can  be  obtained  by  considering  the  difficulty  level 
of  the  questions  answered  correctly  and  by  taking  into  account  which  particular  wrong  answers 
were  selected. 

Objective 

The  objective  of  this  effort  was  to  develop  new  procedures  for  scoring  aptitude  and 
achievement  tests  that  will  increase  the  reliability  and  validity  of  those  tests. 

Approach 

A  new  approach  to  scoring  multiple -choice  items  was  developed.  The  procedure  is  not  based 
on  Item  Response  Theory  (IRT),  and  does  not  require  any  assumptions  regarding  “latent”  abilities, 
the  dimensionality  of  the  set  of  items  analyzed,  or  the  mathematical  form  of  the  regression  of  item 
responses  on  unobservable  variables.  The  procedure  does  assume  that  the  individuals  included  in 
an  item  analysis  are  randomly  sampled  from  the  examinee  population  of  interest.  The  procedure  is 
characterized  as  “linear”  because  each  examinee's  score  is  a  linear  function  of  category  scoring 
weights  and  category-response  indicators. 

The  new  scoring  procedure  is  called  polyweighting.  In  this  procedure,  the  scoring  weights 
obtained  for  an  item  are  independent  of  the  difficulty  of  other  items  included  in  the  item  analysis 
and  the  weights  are  bounded  so  that  examinees  who  give  the  correct  answer  to  an  item  will  always 
receive  the  most  credit.  For  each  correct  answer,  and  each  wrong  answer  selected  by  100  or  more 
examinees,  the  category  scoring  weight  is  approximately  equal  to  the  mean  percentile  rank  among 
examinees  selecting  the  category.  For  each  wrong  answer  selected  by  fewer  than  100  examinees, 
the  scoring  weight  for  the  category  is  “regressed”  toward  the  mean  percentile  rank  among 
examinees  who  chose  any  wrong  answer  on  the  item. 

Results 

A  detailed  example  of  an  item  analysis  using  the  computer  program  “POLY”  is  presented, 
demonstrating  several  features  of  the  scoring  procedure.  In  particular,  the  example  shows  that 
polyweighting  does  not  require  a  fully-crossed  data  matrix  (one  in  which  all  examinees  have  been 
administered  all  questions)  and  that  polyweighting  increases  coefficient-a  for  the  set  of  items 
analyzed. 

Conclusions 

The  scoring  procedure  described  in  this  report  provides  an  improved  foundation  for  scoring 
aptitude  and  achievement  tests.  It  makes  few  assumptions  about  the  available  data  and  can  be 


vii 


implemented  with  smaller  sample  sizes  than  are  required  for  IRT  scoring.  Users  of  the  procedure 
can  elect  either  to  keep  tests  at  their  current  length  and  increase  score  reliability,  or  to  reduce  test 
length  to  save  testing  time  while  maintaining  reliabilities  at  current  levels. 

Recommendations 

Organizations  that  administer  aptitude  and/or  achievement  tests  for  purposes  of  personnel 
selection,  classification,  or  training  should  consider  whether  this  new  scoring  procedure  can  be 
usefully  applied  to  their  tests. 
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Introduction 


Polychotomous  scoring  of  test  items,  while  not  widely  practiced,  has  a  lengthy  history. 
Haladyna  and  Sympson  (1988)  reviewed  that  history  and  distinguished  between  two  approaches  to 
polychotomous  scoring.  One  approach  involves  the  assignment  of  differential  scoring  weights  to 
item  response  categories.  In  this  approach  to  polychotomous  scoring,  the  test  score  is  a  linear 
function  of  the  examinee's  item  response  vector. 

One  method  of  linear  polychotomous  scoring  has  the  unique  property  of  maximizing 
coefficient-a  (Cronbach,  1951)  for  the  set  of  items  calibrated  (Guttman,  1941;  Lord,  1958).  This 
scoring  procedure  has  been  referred  lo  by  various  names,  including  reciprocal  averages  scaling 
(Horst,  1935),  optimal  scaling  (Bock,  1 960),  and  dual  scaling  (Nishisato,  1 980).  Since  these  names 
are  not  suggestive  of  the  method's  primary  distinguishing  characteristic,  the  present  author  refers 
to  it  as  "max-alpha”  (MA)  scaling. 

MA  scaling  has  two  drawbacks  as  an  approach  to  polychotomous  item  scoring.  First,  the 
scoring  weights  that  are  derived  for  an  item  depend  on  the  difficulty  level  of  the  other  items  that 
are  calibrated  at  the  same  time.  If  an  item  is  calibrated  along  with  a  set  of  easy  items,  the  obtained 
scoring  weights  will  be  different  than  if  the  item  were  calibrated  along  with  a  set  of  difficult  items. 
Second,  in  order  to  maximize  a,  the  MA  method  often  assigns  weights  to  wrong  answers  that 
exceed  the  weight  assigned  to  the  correct  response. 

The  second  approach  to  polychotomous  item  scoring  discussed  by  Sympson  and  Haladyna  (1988) 
has  a  shorter  history.  This  approach  derives  from  Item  Response  Theory  (IRT).  IRT  models  for 
polychotomous  calibration  of  multiple-choice  items  have  been  introduced  within  the  past  two 
decades  (Bock,  1972;  Samejima,  1979;  Sympson,  1981, 1983, 1993;  Thissen  &  Steinbeig,  1984). 
In  this  approach  to  polychotomous  item  scoring,  the  test  score  is  a  nonlinear  function  of  the 
examinee's  item  response  vector. 

If  the  set  of  items  calibrated  with  a  polychotomous  IRT  model  is  unidimensional,  and  if  the 
chosen  model  fits  the  items,  the  model  parameters  for  any  one  item  will  be  independent  of  the 
parameters  obtained  for  other  items.  However,  if  the  assumed  model  is  not  correct,  IRT  item 
parameters  are  dependent  on  both  the  examinee  population  that  is  sampled  and  the  set  of  items  that 
is  calibrated.  IRT  calibration  meffiods  require  fairly  large  samples  (A  >  1000  per  item)  in  order  to 
provide  stable  results. 

This  report  introduces  a  new  approach  to  linear  polychotomous  scoring  of  test  items.  The 
approach  is  similar  to  MA  scaling  in  some  regards,  but  provides  scoring  weights  for  a  given  item 
that  are  independent  of  the  difficulty  of  other  items  in  the  analysis.  Moreover,  the  scoring  weights 
are  bounded  so  that  an  examinee  can  never  receive  more  credit  for  an  incorrect  response  ffian  for 
a  correct  response. 

Approach 


Computing  Poly  weights 

Polyweighting  is  a  scoring  procedure  that  uses  a  different  scoring  weight  for  each  item  response 
category.  An  examinee's  polyscore  is  equal  to  the  mean  of  the  scoring  weights  of  the  categories 
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chosen  by  the  examinee.  Polyweighting  does  not  require  the  assumptions  of  IRT,  and  can  be 
applied  with  smaller  samples  than  are  commonly  required  with  IRT  models.  Polyweighting  does 
require  that  item  calibration  be  carried  out  with  a  random  sample  of  examinees  from  the  population 
of  interest. 

Unlike  some  scoring  methods,  polyweighting  gives  the  examinee  more  credit  for  correct 
answers  to  difficult  questions  and  less  credit  for  correct  answers  to  easy  questions.  Conversely, 
polyweighting  penalizes  the  examinee  more  heavily  for  wr^ng  answers  to  easy  questions  than  for 
wrong  answers  to  difficult  questions.  This  may  be  contrasted  with  number/proportion-correct  (PC) 
scoring  and  with  scoring  under  the  1 -parameter  and  2-parameter  logistic  IRT  models.  The  latter 
scoring  methods  assign  scores  to  examinees  in  a  manner  that  renders  the  scores  independent  of  the 
difficulty  of  the  questions  answered  correctly  or  incorrectly  (Bimbaum,  1968,  p.  458). 

In  polyweighting,  the  scoring  weights  assigned  to  item-response  categories  are  referred  to  as 
polyweights.  An  iterative  procedure  must  be  used  to  derive  polyweights  for  a  set  of  items.  The 
procedure  is  as  follows: 

1.  Each  examinee  in  the  calibration  sample  is  assigned  a  provisional  score  equal  to  the 
examinee's  proportion  correct  among  items  the  examinee  was  administered.  It  is  assumed  that 
different  examinees  may  have  been  administered  different  items  during  data  collection,  but  that  an 
adequate  number  of  examinees  (e.g.,  100  or  more)  was  administered  each  “set”  of  items.  It  is  also 
assumed  that  item-sets  were  assigned  to  examinees  randomly,  so  that  each  “item-set  group”  is 
randomly  equivalent  to  other  examinee  groups. 

2.  Since  PC  scores  for  examinees  who  are  administered  different  item-sets  are  not  directly 
comparable  (due  to  variation  in  difficulties  and  other  characteristics  of  the  items  administered), 
each  examinee's  PC  score  is  converted  to  a  percentile  rank  relative  to  those  examinees  who  were 
administered  the  same  item-set.  This  is  equivalent  to  an  equipercentile  equating  of  PC  scores  from 
different  item-sets  (Angoff,  1971,  p.  563).  For  each  examinee,  his/her  percentile  rank  is  the 
proportion  of  examinees,  among  those  who  were  administered  the  same  item-set,  who  obtained  a 
PC  score  that  was  less  than  or  equal  to  the  PC  score  obtained  by  the  given  examinee,  multiplied  by 
100. 

3.  For  each  item,  the  mean  percentile  rank  among  examinees  who  chose  each  possible 
response  category  is  determined.  This  computation  includes  all  examinees  who  were  administered 
a  given  item,  even  if  they  were  administered  different  item-sets.  At  this  point,  if  the  mean  percentile 
rank  among  examinees  who  chose  the  correct  answer  for  a  given  item  is  less  than  the  mean 
percentile  rank  among  all  examinees  who  were  administered  the  item,  the  item  is  deleted  from  the 
analysis.  This  is  equivalent  to  deleting  an  item  if  the  point-biserial  correlation  (Henrysson,  1971, 
p.  142)  between  the  correct  answer  and  examinee  percentile  ranks  becomes  negative. 

4.  For  all  items  and  all  response  categories,  provisional  polyweights  are  computed  as  follows: 

a.  For  each  correct  answer,  the  provisional  poly  weight  is  equal  to  the  mean  percentile  rank 
among  examinees  choosing  the  category,  rounded  to  the  nearest  integer. 
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b.  For  each  wrong  answer  chosen  by  100  or  more  examinees,  the  provisional  poly  weight 
is  equal  to  the  mean  percentile  rank  among  examinees  choosing  the  category,  rounded  to  the 
nearest  integer. 

c.  For  each  wrong  answer  chosen  by  fewer  than  100  examinees,  the  provisional 
polyweight  is  a  rounded  linear  combination  of  the  mean  percentile  rank  among  examinees 
choosing  the  category  and  the  mean  percentile  rank  among  examinees  choosing  any  wrong  answer 
on  the  item.  For  these  categories,  the  polyweight  for  category  j  of  item  /  is  equal  to 


rounded  to  the  nearest  integer.  In  Equ^ion  1,  is  the  mean  percentile  rank  among  examinees 
choosing  any  wrong  answer  on  item  /,  R,j  is  the  mean-percentile  rank  among  examinees  choosing 
category  j,  and  N^j  is  the  number  of  examinees  choosing  category  j. 

5.  Since  examinee  percentile  ranks  range  from  a  minimum  possible  value  of  100(1/AO  to  a 
maximum  possible  value  of  100,  the  provisional  poly  weights  can  assume  any  integer  value  from  0 
to  100.  For  a  given  item,  if  the  provisional  poly  weight  for  an  incorrect  response  is  found  to  equal 
or  exceed  the  provisional  polyweight  for  the  correct  response,  the  polyweight  for  the  incorrect 
response  is  set  equal  to  1  less  than  the  polyweight  for  the  correct  response.  Thus,  under 
polyweighting,  an  examinee  can  never  receive  more  credit  for  an  incorrect  answer  than  for  a 
correct  answer. 

6.  Given  the  provisional  polyweights  for  all  response  categories,  provisional  examinee 
polyscores  are  computed.  As  stated  earlier,  an  examinee's  polyscore  is  equal  to  the  mean  of  the 
polyweights  of  the  categories  chosen  by  the  examinee.  Since  polyscores,  like  all  raw  test  scores, 
are  not  comparable  between  examinees  who  have  taken  different  item-sets,  the  provisional 
polyscores  are  converted  to  percentile  ranks  within  each  group  of  examinees  who  have  been 
administered  the  same  set  of  items. 

7.  Given  the  new  percentile  ranks  for  all  examinees,  the  iterative  procedure  returns  to  Step  3, 
above.  Steps  3  through  6  arc  repeated  until  the  mean  squared  correlation  ratio  between  items  and 
percentile  ranks  stops  increasing. 


Example  and  Discussion 

Output  From  the  Computer  Program  POLY 

Figures  1  through  3  show  selected  portions  of  a  "Primary  Output”  file  generated  by  the 
computer  program  POLY,  In  this  example,  polyweights  were  derived  for  467  items  that  had  been 
administered  to  8,141  applicants  for  military  service.  Each  applicant  was  administered  one  of  three 
86-item  vocabulary  tests  and  one  of  six  35-item  vocabulary  tests.  Thus,  there  were  18  different 
item-sets  administered  to  the  examinees,  with  121  items  in  each  item-set. 
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WORD  KNOWLEDGE  JOINT  CALIBRATION,  S  -  8,141 
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MEAN  SQUARED  ETA(1,%)  BAS  CONVERGED  FOR  467  ITEMS. 
(DELTA  .LE.  ZERO) 


Figure  1.  Convergence  data. 


WORD  KNOWLEDGE  JOINT  CALIBRATION,  S  -  8,141 

THE  rOLLONING  1  ITEM(S)  MERE  NOT  SCORED  BECAUSE 
THE  POZNT-BZSERIAL  CORRELATION  FOR  TBE  KEYED 
RESPONSE  BECAME  NEGATIVE: 
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CHECK  TBE  ANSWER  KEY  AMD/OR  TBE  ITEM(S) . 


Figure  2.  Diagnostic  information  from  the  Primary  Output  File  generated  by  POLY. 
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WORD  KN()WLED(JE  JOINT  CALIBRATION,  S  =  8,141 
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Figure  3.  Two  examples  of  the  summary  item  analysis. 
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Figure  1  shows  “Convergence  Data”  from  the  example  Primary  Output  file.  Column  1  in 
Figure  1  gives  iteration  numbers.  Iteration  0  is  the  iteration  in  which  each  examinee  is  assigned  a 
raw  score  equal  to  his/her  proportion  correct  among  the  items  the  examinee  was  administered. 
Subsequent  iterations  use  provisional  poly  weights  to  compute  examinee  scores. 

Column  2  in  Figure  1  indicates  how  many  items  were  included  in  the  analysis  during  each 
iteration.  In  this  example,  one  item  was  deleted,  starting  in  iteration  2,  because  the  point-biserial 
correlation  between  the  item's  correct  response  and  percentile  rank  scores  became  negative. 

Column  3  in  Figure  1  gives  the  mean,  over  all  retained  items,  of  the  squared  correlation  ratio 
between  an  item  and  percentile  rank  scores.  In  iteration  0,  the  value  reported  is  the  mean  squared 
point-biserial  correlation  between  correct  responses  and  percentile  ranks.  For  a  given  item,  the 
squared  point-biserial  correlation  for  the  correct  response  is  equal  to  the  proportion  of  variance  in 
percentile  ranks  that  is  accounted  for  by  knowing  whether  each  examinee  has  selected  the  correct 
response.  The  point-biserial  correlation  is  a  widely-used  index  of  item  discriminating  power  when 
item  scoring  is  dichotomous. 

In  subsequent  iterations,  the  value  reported  in  column  3  is  the  mean  squared  t|  coefficient 
between  an  item  and  percentile  ranks  (Lord  &  Novick,  1968,  p.  263).  The  squared  q  coefficient 
between  an  item  and  percentile  rank  scores  indicates  the  proportion  of  variance  in  percentile  ranks 
that  is  accounted  for  by  knowing  which  particular  response  category  an  examinee  has  selected.  The 
T)  coefficient  for  an  item  can  never  be  smaller  than  the  correct-answer  point-biserial  correlation.  If 
there  is  any  variation  among  the  score  means  for  the  wrong-answer  categories,  the  t)  coefficient  for 
an  item  will  be  larger  than  the  point-biserial  correlation. 

Column  4  in  Figure  1  shows  the  change  (5)  in  the  mean  squared  correlation  ratio  between 
iterations.  This  quantity  serves  as  the  convergence  criterion  in  POLY  runs.  When  5  becomes  so 
small  that  it  cannot  be  distinguished  from  zero,  or  if  5  becomes  negative,  the  iterations  are 
terminated. 

Column  5  in  Figure  1  gives  the  mean  value  of  coefficient-a  in  the  analysis.  Since  there  were 
18  item-sets  in  this  example,  each  value  reported  in  column  5  is  the  mean  of  18  a  coefficients.  As 
the  example  shows,  polyweighting  increased  the  mean  value  of  a  for  the  item-sets  analyzed.  The 
fact  that  scores  based  on  polyweights  have  higher  a  coefficients  than  do  number/proportion-correct 
scores  implies  that  test  scores  based  on  polyweighting  will  correlate  more  highly  with  domain 
scores  and  will  have  higher  alternate-form  reliabilities. 

Column  6  in  Figure  1  gives  an  index  of  “relative  information.”  This  index  is  based  on  the 
Spearman-Brown  formula  (Lord  &  Novick,  1968,  p.  112).  The  Spearman-Brown  formula  gives  the 
reliability  of  a  lengthened  test  as  a  function  of  the  initial  reliability  of  the  test  and  the  proportionate 
increase  in  test  length  that  is  anticipated.  However,  rather  than  use  the  Spearman-Brown  formula 
to  predict  reliability,  one  can  rearrange  the  formula  and  use  it  to  determine  how  much  a  given  test 
would  have  to  be  inc’^eased  in  length  in  order  to  obtain  a  specified  level  of  reliability  (Nishisato, 
1980,  p.  118). 
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In  POLY,  the  relative  information  index  is  set  equal  to  1 .0000  in  iteration  0.  Subsequent  to 
iteration  0,  the  formula  used  for  computing  relative  information  is 


ttp  (1  -CLj) 
ttj  (1  -ttp) 


(2) 


where  is  the  mean  value  of  coefficient-a  obtained  under  PC  scoring  (iteration  0)  and  is 
the  mean  value  of  coefficient-a  obtained  under  polyweighting.  This  information  index  indicates 
the  proportionate  increase  in  test  length  that  would  be  required  in  order  to  achieve  the  same 
reliability  under  PC  scoring  that  has  been  achieved  using  poly  weighting. 

In  the  example  shown  in  Figure  1,  the  POLY  run  terminated  after  iteration  6.  At  that  time,  the 
mean  value  of  coefficient-a  was  .96017.  Wh£n  this  value  is  substituted  for  a^  in  Equation  2,  and 
the  initial  mean  a  of  .94994  is  substituted  for  a^,  the  obtained  final  value  of  the  relative  information 
index  is  1.2702.  This  indicates  that  a  typical  item-set  in  this  analysis  would  have  to  be  increased  in 
length  by  27%  (i.e.,  from  121  items  to  154  items)  in  order  to  achieve  the  level  of  reliability  under 
PC  scoring  that  was  achieved  using  polyweighting. 

Figure  2  shows  diagnostic  information  from  the  Primary  Output  file  generated  by  POLY.  In  the 
example.  Item  199  was  deleted  (not  scored)  starting  in  iteration  2,  because  the  point-biserial 
correlation  between  the  item's  correct  response  and  percentile  rank  scores  was  negative  at  the  end 
of  iteration  1.  Items  53,  66,  99,  180,  and  223  were  scored,  but  have  been  flagged  for  special 
attention  because  each  of  these  items  had  at  least  one  incorrect  answer  with  a  positive  point-biserial 
correlation  that  was  larger  than  the  point-biserial  correlation  for  the  correct  answer. 

Figure  3  shows  two  examples  of  the  “Summary  Item  Analysis”  provided  for  each  item  in  the 
Primary  Output  file.  Items  65  and  66  were  selected  as  examples  because  Item  65  is  quite  easy  and 
Item  66  is  quite  difficult.  Moreover,  Item  66  is  one  of  the  items  that  was  flagged  by  POLY  as 
needing  special  attention.  Below  the  item-number  for  each  item,  a  25-character  item-identification 
string  is  printed  in  parentheses.  The  user  specifies  this  string  for  each  item  in  the  analysis.  In 
Figure  3,  both  items  came  from  Word  Knowledge  (WK)  test-booklet  number  1. 

The  columns  headed  “CAT.”  in  Figure  3  contain  response-category  identification  numbers. 
Category  0  is  a  pseudo-category  that  corresponds  to  “Not  Administered.”  If  an  examinee's  data- 
record  contains  a  zero  response-code  for  a  given  item,  POLY  does  not  use  that  item  in  computing 
the  examinee's  polyscore.  In  POLY,  eight  categories  are  available  as  scored  categories.  The  user 
must  indicate  the  number  of  categories  that  are  present  for  each  item  in  the  analysis.  The  number 
of  categories  can  vary  from  item  to  item.  In  the  examples  in  Figure  3,  Categories  1  through  5 
correspond  to  choices  “A”  through  “E”  in  these  5-altemative  multiple-choice  items. 

For  the  items  in  Figure  3,  Category  6  corresponds  to  “Omit.”  For  each  item,  a  response-code  of 
6  was  entered  in  the  examinee  data-record  if  the  examinee  did  not  answer  the  item,  but  he/she 
answered  at  least  one  item  that  appeared  later  in  the  same  test  booklet.  Category  7  corresponds  to 
“Not  Reached.”  A  response-code  of  7  was  entered  in  an  examinee's  data-record  if  the  examinee  did 
not  answer  a  given  item,  and  he/she  did  not  answer  any  subsequent  items  in  the  same  test  booklet. 
This  use  of  Categories  6  and  7  is  specific  to  our  example.  During  a  POLY  run,  the  last  two  categories 
for  an  item  are  treated  no  differently  than  the  other  response  categories  (except  Category  0). 
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In  Figure  3,  the  last  value  that  appears  in  the  columns  headed  "CAT.”  identifies  a  composite 
pseudo-category  that  collapses  all  actual  response  categories  into  one.  In  the  examples,  this 
pseudo-category  is  labeled  "1-7”  because  Items  65  and  66  were  each  specified  to  have  seven 
categories.  Summary  statistics  derived  from  all  examinees  who  were  administered  a  given  item 
(i.e.,  from  all  categories  other  than  Category  0)  are  associated  with  this  pseudo-category. 

The  entries  in  the  columns  headed  "FREQ.”  in  Figure  3  indicate  how  many  examinees  were 
associated  with  each  response  category.  The  frequency  shown  for  Category  0  is  the  number  of 
examinees  who  were  not  administered  the  item  (N  =  5,312  for  Items  65  and  66).  The  frequency 
shown  for  the  composite  pseudo-category  (Category  1-7  in  the  example)  is  the  number  of 
examinees  who  were  administered  the  item  (N  =  2,829  for  Items  65  and  66).  The  other  frequencies 
in  this  column  correspond  to  the  categories  indicated  in  column  1.  The  double-asterisk  (**)  that 
appears  between  columns  1  and  2  identifies  the  keyed  (correct)  response  for  each  item. 

In  Figure  3,  the  entries  in  the  columns  headed  "PROR”  indicate  the  proportion  of  the  examinee 
sample  that  was  associated  with  each  response  category.  In  these  columns,  the  proportions  are 
based  on  the  entire  examinee  sample  (N  -  8,141  in  the  example).  The  entries  in  the  columns  headed 
"ADJ.  PROP.”  (column  7)  are  adjusted  proportions.  These  proportions  are  based  on  Just  the 
examinees  who  were  actually  administered  an  item.  Thus,  for  Item  65,  .3414  of  the  examinee 
sample  gave  the  keyed  response  (Category  3,  or  "C”).  However,  since  only  .3475  of  the  sample 
was  administered  the  item,  the  adjusted  proportion  for  Category  3  is  .9823,  indicating  that  Item  65 
was  quite  easy.  This  may  be  contrasted  with  the  adjusted  proportion  for  the  correct  answer  to 
Item  66  (.1421),  which  indicates  that  Item  66  was  quite  difficult. 

The  columns  headed  "%-ILE  MEAN”  and  ”%-ILE  S.D.”  in  Figure  3  give  the  means  and 
standard  deviations  of  percentile  rank  scores  among  examinees  associated  with  each  response 
category.  Means  and  standard  deviations  are  computed  for  the  pseudo-categories  (Categories  0 
and,  in  this  example,  1-7)  so  that  the  user  can  check  for  obvious  violations  of  the  requirement  that 
each  item  be  administered  to  a  random  sample  from  the  examinee  population.  For  each  item,  the 
means  and  standard  deviations  for  the  two  pseudo-categories  should  be  similar.  If  they  are  not,  it 
suggests  that  randomly  equivalent  item-set  groups  were  not  achieved. 

In  the  example,  the  mean  percentile  rank  among  the  2,779  examinees  selecting  the  correct 
response  on  Item  65  was  50.90.  The  mean  percentile  rank  among  individuals  choosing  a  wrong 
answer  on  this  item  ranged  from  a  high  of  12.13  among  the  8  individuals  who  chose  Category  4 
(“D”)  to  a  low  of  1.13  among  the  7  individuals  who  did  not  reach  the  item  (Category  7). 

The  mean  percentile  rank  among  the  402  examinees  who  selected  the  correct  response  on 
Item  66  was  61.20.  The  mean  percentile  rank  among  individuals  choosing  a  wrong  answer  on 
Item  66  ranged  from  a  high  of  53.61  among  the  1787  individuals  who  chose  Category  5  ("E”),  to 
a  low  of  2.61  among  the  8  individuals  who  did  not  reach  the  item  (Category  7).  Most  of  the 
category  means  for  wrong-answers  are  substantially  higher  for  Item  66  than  for  Item  65. 

Final  (iteration  6)  polyweights  for  Items  65  and  66  are  shown  in  the  columns  labeled  "Scoring 
Weight”  in  Figure  3.  For  the  keyed  response  categories,  and  for  wrong  answers  selected  by  100  or 
more  examinees,  these  weights  are  the  category  means  from  iteration  5,  rounded  to  the  nearest 
integer.  For  wrong-answer  categories  selected  by  fewer  than  100  examinees,  the  scoring  weights 
were  obtained  by  inserting  percentile  means  from  iteration  5  into  Equation  1 ,  and  rounding  the 
resulting  values  to  the  nearest  integer. 
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For  both  example  items,  none  of  the  wrong-answer  scoring  weights  exceeded  the  poly  weight 
for  the  item's  keyed  response,  so  none  of  the  wrong-answer  weights  had  to  be  bounded.  If  any 
weight  had  been  bounded  (set  equal  to  1  less  than  the  polyweight  for  the  keyed  answer),  a  single 
asterisk  (*)  would  have  appeared  to  the  right  of  the  bounded  weight. 

Consideration  of  the  scoring  weights  shown  in  Figure  3  gives  an  indication  of  the  impact  of 
polyweighting.  An  examinee  who  answers  Item  66  correctly  will  receive  more  credit  than  an 
examinee  who  answers  Item  65  correctly  (61  vs.  51).  Conversely,  an  examinee  who  answers 
Item  65  incorrectly  will  be  penalized  more  heavily  than  an  examinee  who  answers  Item  66 
incorrectly  (a  score  of  9  or  less  if  Item  65  is  answered  incorrectly  vs.  a  score  of  at  least  25  if  Item  66 
is  answered  incorrectly). 

The  columns  headed  “R(C,%)”  in  Figure  3  contain  point-biserial  correlations  between 
individual  response  categories  and  percentile  rank  scores.  In  the  case  of  Item  65,  there  is  only  one 
positive  point-biserial,  the  one  for  the  keyed  answer.  In  the  case  of  Item  66,  there  are  two  positive 
point-biserials,  one  associated  with  the  keyed  answer,  and  one  associated  with  Category  5  (“E”). 
In  fact,  the  point-biserial  for  Category  5  is  slightly  higher  than  the  point-biserial  for  the  keyed 
answer.  This  is  why  Item  66  was  mentioned  in  the  ^agnostic  output  shown  in  Figure  2. 

In  Figure  3,  the  last  summary  statistic  printed  for  each  item  is  labeled  “ETA(I,%).”  This  is  the 
eta  (T))  coefficient  for  the  item.  For  Item  65,  the  x]  coefficient  is  only  slightly  larger  than  the  point- 
biserial  for  the  keyed  response  (.1972  vs.  .1963),  which  indicates  that  polychotomous  scoring  of 
this  item  will  add  very  little  to  measurement  precision.  This  is  not  unexpected,  since  Item  65  is  very 
easy  and  few  wrong  answers  are  observed.  For  Item  66,  the  t|  coefficient  is  substantially  larger  than 
the  point-biserial  for  the  keyed  response  (.3437  vs.  .1559),  indicating  that  polychotomous  scoring 
of  this  item  will  provide  useful  additional  information  about  an  individual's  percentile  rank  within 
the  examinee  population. 

An  item  such  as  Item  66  would  often  be  discarded  in  a  traditional  item  analysis  because  of  the 
relatively  large  positive  point-biserial  correlation  for  Category  5,  a  wrong  answer.  However,  by 
using  poly  weighting,  this  apparently  bad  item  can  be  retained  and  used  to  gather  useful  information 
about  examinee  ability.  As  indicated  by  the  percentile  mean  for  Category  5,  examinees  who  select 
this  category  are,  on  the  average,  of  higher  ability  than  examinees  who  select  the  other  wrong 
answers.  This  fact  is  taken  into  account  when  the  item  is  scored  using  the  polyweights  shown  in 
Figure  3. 

One  would  not  want  to  use  Item  66  in  a  test  without  further  investigation  of  its  psychometric 
properties  and  its  content.  To  aid  in  this  process,  POLY  can  generate  an  “Endorsement-Rate 
Tables”  file.  The  endorsement-rate  table  for  Item  66  is  shown  in  Figure  4.  In  order  to  create  tables 
like  the  one  shown  in  Figure  4,  POLY  can  divide  the  examinee  sample  into  as  many  as  1(X)  ability 
groups,  based  on  their  percentile  ranks.  Then,  using  only  the  examinees  who  were  administered  a 
particular  item,  the  proportion  of  examinees  who  gave  each  response  is  computed  within  each 
group.  In  the  example,  POLY  was  instructed  to  form  50  ability  groups. 

In  Figure  4,  the  mean  percentile  rank  within  each  ability  group  is  shown  in  the  column  labeled 
“%-ILE.”  The  columns  labeled  “CATl”  through  “CAT7”  contain  the  proportions  selecting  each 
category,  within  each  ability  group.  It  is  notable  that  the  proportion  selecting  the  correct  answer 
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(Category  4)  is  below  chance  level  (i.e.,  below  .20)  over  virtually  all  of  the  ability  range  from  the 
1st  to  the  89th  percentiles.  Only  in  the  top  decile  of  this  examinee  sample  does  the  proportion  of 
examinees  that  select  the  keyed  answer  start  increasing,  with  the  highest  ability  group  (the  top  2%) 
selecting  the  keyed  answer  about  87%  of  the  time. 
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0.01818 

0.0S4SS 

0.80000 

0. 

0. 

20 

55 

38.97 

0.14545 

0.05455 

0.03636 

0.12727 

0.63636 

0. 

0. 

21 

58 

40.97 

0.12069 

0.05172 

0.06897 

0.08621 

0.65517 

0.01724 

0. 

22 

57 

43.00 

0.12281 

0.03509 

0.01754 

0.05263 

0.75439 

0.01754 

0. 

23 

57 

45.03 

0.08772 

0.05263 

0.10526 

0.08772 

0.66667 

0. 

0. 

24 

56 

47.02 

0.12500 

0.03571 

0.07143 

0.05357 

0.71429 

0. 

0. 

25 

S3 

48.96 

0.14545 

0. 

0.05455 

0.07273 

0.70909 

0.01818 

0. 

26 

57 

50.95 

0.07018 

0.03509 

0.03509 

0.05263 

0.80702 

0. 

0. 

27 

57 

52.98 

0.12281 

0.01754 

0.07018 

0.08772 

0.70175 

0. 

0. 

26 

58 

54.99 

0.12069 

0.01724 

0.03448 

0.12069 

0.70690 

0. 

0. 

29 

55 

57.01 

0.12727 

0.01818 

0. 

0.07273 

0.78182 

0. 

0. 

30 

56 

58.95 

0.19643 

0.01786 

0. 

0.03571 

0.75000 

0. 

0. 

31 

55 

60.96 

0.10909 

0. 

0.03636 

0.07273 

0.78182 

0. 

0. 

32 

57 

62.92 

0.14035 

0.05263 

0. 

0.01754 

0.78947 

0. 

0. 

33 

59 

64.94 

0.08475 

0.01695 

0.05085 

0.01695 

0.83051 

0. 

0. 

34 

57 

66.99 

0.21053 

0.01754 

0.05263 

0.07018 

0.64912 

0. 

0. 

35 

57 

69.01 

0.08772 

0.03509 

0.01754 

0.08772 

0.73684 

0.03509 

0. 

36 

56 

70.99 

0.01786 

0. 

0.01786 

0.10714 

0.85714 

0. 

0. 

37 

55 

73.02 

0.09091 

0. 

0.07273 

0.12727 

0.70909 

0. 

0. 

38 

58 

74.98 

0.10345 

0. 

0.01724 

0.08621 

0.79310 

0. 

0. 

*9 

54 

76.96 

0.07407 

0.03704 

0.03704 

0.03704 

0.81481 

0. 

0. 

.0 

59 

78.96 

0.13559 

0. 

0.01695 

0.05085 

0.77966 

0.01695 

0. 

41 

55 

80.96 

0.03636 

0. 

0. 

0.10909 

0.85455 

0. 

0. 

42 

60 

82.99 

0.05000 

0. 

0.01667 

0.08333 

0.85000 

0. 

0. 

43 

55 

85.05 

0.07273 

0.01818 

0.01818 

0.05455 

0.83636 

0. 

0. 

44 

56 

86.99 

0.01786 

0. 

0.01786 

0.16071 

0.80357 

0. 

0. 

45 

58 

89.00 

0.06897 

0. 

0. 

0,15517 

0.77586 

0. 

0. 

46 

56 

91.02 

0.10714 

0. 

0. 

0.28571 

0.60714 

0. 

0. 

47 

53 

92.94 

0.03774 

0. 

0.03774 

0.39623 

0.52830 

0. 

0. 

48 

60 

94.95 

0. 

0. 

0.05000 

0.41667 

0.53333 

0. 

0. 

49 

56 

97.00 

0.01786 

0. 

0.01786 

0.71429 

0.25000 

0. 

0. 

50 

61 

99.05 

0. 

0. 

0. 

0.86885 

0.13115 

0. 

0. 

Figure  4.  Endorsement-rate  table  for  Item  66. 
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In  Figure  4,  the  endorsement  rates  for  wrong  answers  usually  decline  as  ability  level  increases, 
though  the  rate  of  decline  varies  between  response  categories.  A  pronounced  exception  to  this 
pattern  is  observed  for  Category  5,  where  the  endorsement  rate  goes  from  chance  level,  among  the 
lowest  ability  groups,  to  over  .80  within  ability  groups  near  the  75th  percentile.  In  the  top  decile  of 
the  examinee  sample,  the  endorsement  rate  for  Category  5  finally  starts  dropping,  declining  to 
about  .13  in  the  highest  ability  group. 

To  aid  interpretation  of  the  endorsement-rate  table  shown  in  Figure  4,  Figures  5  through  11 
show  graphic  plots  of  the  category  endorsement  rates  for  Item  66.  The  computer  program  POLY 
does  not  provide  this  type  of  plot,  but  such  plots  can  be  generated  using  the  endorsement-rate  tables 
that  are  available  from  POLY.  The  plots  in  Figures  5  through  11  also  show  fitted  functions  that 
smooth  and  interpolate  the  plotted  endorsement  rates.  It  is  clear  from  Figures  10  and  11  that  few 
examinees  omit  or  do  not  reach  Item  66  (Categories  6  and  7).  Most  examinees  select  ptegory  5 
(Figure  9)  and  only  the  most  able  examinees  select  Category  4  (Figure  8)  with  greater  than  chance 

frequency. 


PERCENTILE/IOO 

Figure  5.  Plot  uf  Category  1  endorsement  rates  for  Item  66. 


II 


PROBABILITY  PROBABILITY 


Figure  6.  Plot  of  Category  2  endorsement  rates  for  Item  66. 


Figure  7.  Plot  of  Category  3  endorsement  rates  for  Item 
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The  most  important  step  in  the  evaluation  of  an  unusual  item  is  a  careful  inspection  of  its 
content.  Some  information  about  Item  66  may  be  of  interest.  First,  as  mentioned  previously. 
Item  66  is  a  5-aItemative  multiple-choice  WK  item.  This  item  asks  the  examinee  to  select  the  best 
synonym  for  “respite.”  The  keyed  answer  is  “rest”  (Category  4).  Category  5,  the  very  popular 
wrong  answer,  is  “grudge.”  It  appears  that  all  but  the  most  knowledgeable  examinees  were  fooled 
by  the  presence  of  the  word  “spite”  within  the  item  stem.  Further  inspection  of  Item  66  gave  no 
indication  of  a  problem  with  the  content  of  the  item,  so  its  use  in  a  polychotomously-scored  test 
seems  appropriate. 

The  Primary  Output  file  generated  by  POLY  also  contains  summary  statistics  for  each  item-set 
administered.  An  example  is  shown  in  Figure  12.  As  mentioned  earlier,  there  were  18  item-sets 
administered  in  this  item  calibration  example.  Figure  1 2  shows  summary  statistics  for  one  of  these 
item-sets.  Statistics  provided  by  POLY  include  the  number  of  examinees  who  were  administered 
the  item-set,  the  number  of  items  in  the  item-set,  the  mean  and  standard  deviation  of  raw  polyscores 
for  the  item-set,  the  minimum  raw/standardized  polyscore  observed,  the  maximum  raw/ 
standardized  polyscore  observed,  and  coefficient-a  for  the  item-set.  In  addition  to  these  summary 
statistics,  a  table  giving  the  mean  raw  polyscore  and  the  mean  standardized  polyscore  for  each  of 
25  equal-frequency  score  groups  is  printed.  This  table  allows  the  user  to  gain  an  impression  of  the 
shape  of  the  distribution  of  raw/standardized  polyscores  for  each  item-set.  In  Figure  12,  it  is  clear 
that  the  distribution  of  raw/standardized  polyscores  was  skewed  left  for  item-set  1. 

Conclusion 

This  concludes  our  discussion  of  example  output  from  the  program  POLY.  The  example 
demonstrates  that  polyweighting  can  be  used  to  calibrate  large  item-pools  in  which  different 
examinees  have  been  administered  different  test  questions.  Until  now,  it  was  necessary  to  adopt  the 
assumptions  of  IRT  in  order  to  analyze  this  type  of  data.  Such  assumptions  are  no  longer  necessary. 

The  example  also  demonstrates  that  polyweighting  increases  the  internal-consistency 
reliability  of  the  item-sets  (tests)  to  which  it  is  applied.  Available  research  (Sympson  &  Davison, 
1993;  Sympson  &  Haladyna,  1993)  demonstrates  that  such  reliability  increases  hold  up  well  in  new 
samples  of  examinees  from  the  same  population. 
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WORD  KNOWLEDGE  JOINT  CALIBRATION,  N  =  8, 141 


SUMMARY  STATISTICS  FOR  ITEM-SET  1 


NUMBER  OF  EXAMINEES  -  507 

NUMBER  OF  ITEMS  -  121 


RAN  SCORE  MEAN  «  50.00 
RAN  SCORE  STANDARD  DEVIATION  «  4.36 


MINIMUM  RAN  SCORE  *  33.11 
MINIMUM  STANDARD  SCORE  •  -3.8785 

(CASE  5088) 

MAXIMUM  RAN  SCORE  -  56.39 

MAXIMUM  STANDARD  SCORE  -  1.4671 

(CASE  358) 


AIPBA  -  0.95613 


MEAN  SCORES  FOR  ORDERED  8COR£-<mOUFS 


GROUP 

N 

RAN  SC. 

STD  SC. 

1 

21 

37.92 

-2.7731 

2 

21 

42.41 

-1.7417 

3 

21 

44.30 

-1.3094 

4 

21 

45.55 

-1.0208 

5 

23 

46.56 

-0.7902 

6 

21 

47.35 

-0.6080 

7 

21 

48.00 

-0.4589 

6 

20 

48.50 

-0.3450 

9 

20 

48.83 

-0.2695 

10 

20 

49.29 

-0.1621 

11 

20 

49.87 

-0.0290 

12 

20 

50.36 

0.0840 

13 

20 

50.83 

0.1915 

14 

20 

51.22 

0.2801 

IS 

20 

51.55 

0.3562 

16 

20 

52.03 

0.4663 

17 

20 

52.37 

0.5450 

18 

20 

52.74 

0.6294 

19 

20 

53.13 

0.7186 

20 

20 

53.72 

0.8549 

21 

20 

54.16 

0.9560 

22 

20 

54.60 

1.0561 

23 

20 

55.00 

1.1480 

24 

20 

55.51 

1.2656 

25 

20 

56.06 

1.3910 

Figure  12.  Summary  statistics  for  Item-set  1. 
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